A Stereo Vision Lip Tracking Algorithm and Subsequent Statistical Analyses of the Audio-Video Correlation in Australian English

نویسندگان

Roland Goecke

Bruce Millar

Alex Zelinsky

چکیده

Human perception of the world is inherently multi-sensory because the information provided is multimodal. The perception of spoken language is no exception. Beside the auditory information, there is visual speech information as well, provided by the facial movements as a result of moving the articulators during speech production. Visual speech information contributes to speech perception in all kinds of audio conditions, but its effect is perhaps most readily noticed in noisy audio conditions. Various research groups around the world have studied the effects of incorporating visual speech information in automatic speech recognition (ASR) systems in recent years. They have found that audio-video (AV) ASR systems result in an improved recognition rate compared to audio-only systems, in particular in noisy audio conditions. Exactly how to incorporate the additional visual speech information best is still not known. This study aims to extend our knowledge of relationships between audio and video speech parameters. It investigates ways of describing such relationships using statistical analyses and their application to the example of Australian English (AuE). The work described in this thesis is multi-disciplinary. Apart from the statistical analyses, it also required algorithms to extract speech parameters and a corpus of AV speech sequences, which were not readily available. A novel non-intrusive automatic lip tracking algorithm is presented, which uses a stereo camera system to enable accurate 3D measurements of facial feature points without the need for artificial markers on the face. Due to the lack of an AV speech corpus for AuE, a new modular framework for AV speech corpora was developed and followed in a newly created corpus for AuE. Equipped in such ways, it was possible to test the hypothesis that combinations of audio and video speech parameters are related, rather than single parameters, and that these combinations are phoneme-specific. Based on articulatory theory, it is clear that the audio and video domain are related in some way and to some

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Stereo Vision Lip-tracking for Audio-video Speech Processing

We present the first results from applying a recently proposed novel algorithm for the robust and reliable automatic extraction of lip feature points to an audio-video speech data corpus. This corpus comprises 10 native speakers uttering sequences that cover the range of phonemes and visemes in Australian English. The lip-tracking algorithm is based on stereo vision which has the advantage of m...

متن کامل

3d Lip Tracking and Co-inertia Analysis for Improved Robustness of Audio-video Automatic Speech Recognition

Multimodality is a key issue in robust humancomputer interaction. The joint use of audio and video speech variables has been shown to improve the performance of automatic speech recognition (ASR) systems. However, robust methods in particular for the real-time extraction of video speech features are still an open research area. This paper addresses the robustness issue of audio-video (AV) ASR s...

متن کامل

3d Lip Tracking and Co-inertia Analysis for Improved Robustness of Audio-video Automatic Speech Recognition

متن کامل

The audio-video australian English speech data corpus AVOZES

This paper presents the Audio-Video Australian English Speech data corpus AVOZES. It contains recordings of 20 speakers uttering a variety of phrases. The corpus was designed for research on the statistical relationship of audio and video speech parameters with an audio-video (AV) automatic speech recognition (ASR) task in mind, but may be useful for other research tasks. AVOZES is the first pu...

متن کامل

Statistical analysis of the relationship between audio and video speech parameters for Australian English

After decades of research, automatic speech processing has become more and more viable in recent years. Audio-video speech recognition has been shown to improve the recognition rate in noise-degraded environments. However, which audio and video speech parameters to choose for an optimal system and how they are related is still an open research issue. Here we present a number of statistical anal...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2004

A Stereo Vision Lip Tracking Algorithm and Subsequent Statistical Analyses of the Audio-Video Correlation in Australian English

نویسندگان

چکیده

منابع مشابه

Stereo Vision Lip-tracking for Audio-video Speech Processing

3d Lip Tracking and Co-inertia Analysis for Improved Robustness of Audio-video Automatic Speech Recognition

3d Lip Tracking and Co-inertia Analysis for Improved Robustness of Audio-video Automatic Speech Recognition

The audio-video australian English speech data corpus AVOZES

Statistical analysis of the relationship between audio and video speech parameters for Australian English

عنوان ژورنال:

اشتراک گذاری